PPO Training does not improve SFT model outputs (Metrics identical before and after PPO)

xmriz · May 18, 2025, 6:44am

Hi All,

I’m currently facing an issue where applying PPO training using trl==0.11.3 on top of my SFT model seems to make no difference at all in the final evaluation metrics. Here’s a snapshot of the comparison before and after PPO:

=== Summary Metrics ===
| model                     | exact_match | rouge1_f1 | rouge2_f1 | rougeL_f1 | bleu     | meteor  | inference_time_sec |
|---------------------------|-------------|-----------|-----------|-----------|----------|---------|---------------------|
| SFT-SeaLLMs-v3-7B-LoRA    | 0.0428571   | 0.657684  | 0.524886  | 0.618101  | 0.418482 | 0.6555  | 46.5213             |
| PPO-SFT-SeaLLMs-v3-7B-LoRA| 0.0428571   | 0.657684  | 0.524886  | 0.618101  | 0.418482 | 0.6555  | 42.8529             |

Issue

Although the PPO loop runs successfully, the final model appears to behave identically to the original SFT model. Rewards do vary across training steps, and no errors are raised. Yet, the downstream generation quality remains unchanged.

Below is the full training code I’m using (compatible with trl==0.11.3 ):

#!/usr/bin/env python
# Compatible with trl==0.11.3
# Launch with: accelerate launch --config_file accelerate_config_ppo.yaml train_ppo.py

import os
import json
import math
from functools import lru_cache

import torch
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer
from trl import (
    set_seed,
    PPOConfig,
    PPOTrainer,
    AutoModelForCausalLMWithValueHead,
    create_reference_model,
)
from trl.core import respond_to_batch
from openai import OpenAI
from tqdm.auto import tqdm

SEED = 42
MODEL_NAME = os.getenv("POLICY_MODEL", "../sft/sft_mp_final")
DATA_PATH = os.getenv("DATASET_PATH", "../datasets/cqa_ppo.jsonl")
LOG_DIR = os.getenv("LOG_DIR", "trainer_output/ppo_logs")
OUTPUT_DIR = os.getenv("OUTPUT_DIR", "ppo_model_final")

set_seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)


ppo_cfg = PPOConfig(
    model_name=MODEL_NAME,
    learning_rate=1e-5,
    batch_size=1,
    world_size=8,
    mini_batch_size=1,
    seed=42,
    ppo_epochs=1,
    kl_penalty="full",
    log_with="tensorboard",
    project_kwargs={"logging_dir": LOG_DIR},
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, local_files_only=True, use_fast=True)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, local_files_only=True
)
ref_model = create_reference_model(policy_model)

raw_ds = load_dataset("json", data_files=DATA_PATH, split="train").select(range(10))

def process_data(raw_ds):
    def make_prompt_response(example):
        prompt = f"{example['context']}\n\nPertanyaan: {example['question']}\nJawaban:"
        response = example['answer']
        return {'prompt': prompt, 'response': response}

    return raw_ds.map(make_prompt_response, remove_columns=raw_ds.column_names)

processed_ds = process_data(raw_ds)

@lru_cache(maxsize=None)
def score_response(prompt: str, response: str, reference: str) -> dict:
...

ppo_trainer = PPOTrainer(
    config=ppo_cfg,
    model=policy_model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

generation_kwargs = {"top_k": 0.0, "top_p": 1.0, "do_sample": True}
device = ppo_trainer.current_device

for step, sample in enumerate(tqdm(processed_ds, desc="PPO Training")):
    prompt = sample["prompt"]
    reference = sample["response"]

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.squeeze(0).to(device)
    out = ppo_trainer.generate(input_ids, **generation_kwargs).squeeze(0)
    new_tokens = out[input_ids.shape[0]:]
    response_str = tokenizer.decode(new_tokens, skip_special_tokens=True)

    metrics = score_response(prompt, response_str, reference)
    reward_val = metrics["reward"]
    reward_t = torch.tensor(reward_val, device=device)

    stats = ppo_trainer.step([input_ids], [new_tokens], [reward_t])
    tqdm.write(f"Step {step:3d}: reward = {reward_val:.4f}")

# Save the final PPO-tuned model
os.makedirs("ppo_fine_tuned_model", exist_ok=True)
ppo_trainer.save_pretrained("ppo_fine_tuned_model")

If anyone sees anything that might explain why PPO is not affecting the model outputs, I’d really appreciate your help. Thanks in advance!

John6666 · May 19, 2025, 5:08am

Hmm… That’s a pretty old version of TRL…

In any case, it’s unlikely that the model output would remain completely unchanged, regardless of whether the fine-tuning was successful or not. I think the process of rewriting the tensor is either not being performed or is failing. Or maybe some processing is missing.

If the same code works with other models, then perhaps the layers are frozen or set to non-trainable…

github.com/huggingface/trl

trl/trainer/ppo_trainer.py

v0.11.3

# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import inspect
import math
import os
import time
import typing
import warnings
from contextlib import nullcontext

This file has been truncated. show original

Topic		Replies	Views
Identical Evaluation Metrics for SFT & DPO–Fine-Tuned LoRA Adapter on SeaLLMs-v3-7B 🤗Transformers	1	11	May 22, 2025
New Version of PPOTrainer 🤗Transformers	6	371	November 24, 2024
TRL + PPO + Using Conditioned Reference Model Intermediate	3	57	January 27, 2025
PPO using TRL: optimal strategy for reward calculation? Research	1	904	December 20, 2023
Practical Exercise: GRPO with Unsloth reward curve Course	1	137	April 1, 2025

PPO Training does not improve SFT model outputs (Metrics identical before and after PPO)

Issue

Related topics